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Abstract 

Background: The least absolute shrinkage and selection operator (LASSO) can be used to predict SNP effects. This 
operator has the desirable feature of including in the model only a subset of explanatory SNPs, which can be 
useful both in QTL detection and GWS studies. LASSO solutions can be obtained by the least angle regression 
(LARS) algorithm. The big issue with this procedure is to define the best constraint (t), i.e. the upper bound of the 
sum of absolute value of the SNP effects which roughly corresponds to the number of SNPs to be selected. Usai et 
al. (2009) dealt with this problem by a cross-validation approach and defined f as the average number of selected 
SNPs overall replications. Nevertheless, in small size populations, such estimator could give underestimated values 
of t. Here we propose two alternative ways to define f and compared them with the "classical" one. 

Methods: The first (strategy 1), was based on 1,000 cross-validations carried out by randomly splitting the 
reference population (2,000 individuals with performance) into two halves. The value of t was the number of SNPs 
which occurred in more than 5% of replications. The second (strategy 2), which did not use cross-validations, was 
based on the minimization of the Cp-type selection criterion which depends on the number of selected SNPs and 
the expected residual variance. 

Results: The size of the subset of selected SNPs was 46, 189 and 64 for the classical approach, strategy 1 and 2 
respectively. Classical and strategy 2 gave similar results and indicated quite clearly the regions were QTL with 
additive effects were located. Strategy 1 confirmed such regions and added further positions which gave a less 
clear scenario. Correlation between GEBVs estimated with the three strategies and TBVs in progenies without 
phenotypes were 0.9237, 0.9000 and 0.9240 for classical, strategy 1 and 2 respectively. 

Conclusions: This suggests that the Cp-type selection criterion is a valid alternative to the cross-validations to 
define the best constraint for selecting subsets of predicting SNPs by LASSO-LARS procedure. 



Background 

A method to estimate the SNP (Single Nucleotide Poly- 
mophism) effects would be to use the least absolute 
shrinkage and selection operator (LASSO) approach [1]. 
This operator has the desirable feature of including in 
the model only a subset of explanatory variables, setting 
to zero those that have nil effects. This agrees with the 
assumption that many chromosome segments will not 
contain QTL (Quantitative Trait Locus) and therefore 
have zero effect, and only few are real QTL [2]. The 
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LASSO is a constrained version of ordinary least squares 
which minimizes the residual sum of squares constrain- 
ing the sum of absolute values of the regression coeffi- 
cients. Then the LASSO solution is the set of SNP 
effects that satisfy: 



^2 I Yi - y.JCijffi I | subject to ^ < t for t > 0 



Where jy, is the phenotype of the t individual; Xij is 
the genotype of the i th individual at the / marker; ftj is 
the allelic substitution effect for the f marker and t is 
the constraint that allows some estimated SNP effects to 
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be exactly zero. The challenge with implementing the 
LASSO approach is how to best choose the constraint 
parameter (t) which in turn depends on the size of the 
subset of explanatory variables, in this case the number 
of SNPs [3]. 

The LASSO problem can be solved by quadratic pro- 
gramming [1] or by Bayesian approaches [4]. The latter 
have been implemented in several genome wise selection 
(GWS) and QTL detection studies [5,6]. An alternative 
way to produce LASSO solution is a modified version of 
the Least Angle Regression (LARS) algorithm [7]. This 
procedure is a version of traditional forward selection 
methods which estimates the effects by successive itera- 
tions. For each iteration the SNP with the highest absolute 
correlation between genotypes and current residuals is 
added to the model. To obtain LASSO solutions the LARS 
procedure is modified so that either addition or subtrac- 
tion of one marker to the model per iteration may occur. 
Usai et al. [8] suggested the LASSO-LARS method to esti- 
mate the marker effects for genomic selection including a 
cross-validation step to define the best constraint t. With 
this approach good results were obtained both in simu- 
lated and real dataset. Nonetheless one limitation of 
LASSO-LARS is that, as a constrained version of the 
ordinary least squares, it cannot estimate the effects for a 
number of markers larger than the number of individuals 
in the reference population. Thus in real data, where the 
reference population size is often relatively limited and for 
those traits affected by a large number of QTL, LASSO- 
LARS may not be able to predict the effect of all the QTL 
contributing to the total genetic variability. Moreover, by 
using cross-validations at each replicate the training sam- 
ple where SNP effects are estimated is only a portion of 
the total reference population. So the number of SNPs 
selected at each replication will be smaller than expected 
and their average overall replications will underestimate 
the true value of the best t. 

In this study we propose two alternative strategies to 
define the t and compared them with the classical strat- 
egy [8]. 

Methods 

Data 

A simulated data set of 3,220 individuals generated for 
the 15th QTL-MAS workshop was used. The first gen- 
eration consisted in 220 founders (20 males and 200). 
The second generation consisted in 3,000 individuals 
organized in 20 sire families of 150 individuals each and 
200 dam families of 15 individuals each. The dam 
families were nested in the sire families. The genome 
consisted in five chromosomes. Each chromosome was 1 
Morgan long and carried 1998 SNPs evenly distributed. 
Genotypes were available for all the individuals. Pheno- 
types were available only for 2,000 progenies (1/3 of each 



sire and dam family) which represented the reference 
population. The further 1,000 progenies had genotypes 
but no phenotypes and represented the candidate 
population. 

LASSO-LARS classical strategy 

At each cross-validation replication the reference popula- 
tion was randomly split into training (T) and validation 
(V) samples of equal size. This strategy corresponded to 
that suggested by Usai et al. [8] where the 50 % random 
splitting was chosen since it gave the lowest t variability 
and the highest accuracy of GEBV estimates. At each 
replication LASSO-LARS was run on T sample. At each 
step of the procedure the genomic breeding values 
(GEBV) of V was updated by the current set of SNP effects 
estimated on T. When the maximum of the correlation 
between the GEBVs and phenotypes of V was reached the 
LASSO-LARS was stopped and the number and the iden- 
tity of the active SNPs were retained. The procedure was 
replicate 1,000 times. Afterwards the average number of 
active SNPs was taken as the best value of t. Moreover the 
SNP frequency of occurrence (fo) was defined as the pro- 
portion of times that a SNP was selected (i.e. with non 
zero effect) over all replications. 

Strategy 1 

The only difference with respect to the classical strategy 
was that here the best t was defined as the number of 
SNPs which occurred in the cross-validations more than 
5% of the times. This strategy was based on the assump- 
tion that if a SNP is selected more than expected by 
chance it is probable that it affects or is linked to QTL 
affecting the traits. The probability of a SNP to be selected 
by chance was estimated by permuting 10,000 times phe- 
notypes on genotypes and by running LASSO-LARS until 
the best t estimated for classical strategy was attained. The 
frequency of occurrence of 5% corresponded to a probabil- 
ity lower than 0.0001. 

Strategy 2 

In this strategy the best t was the number of active 
SNPs which minimize the value of the Cp-type selection 
criterion [7]. Such parameter is commonly used as a 
stopping rule for various forms of stepwise regression. 
At each k th LASSO-LARS step Cp-type was calculated 
as: 

n 2 

E in - Yd 

Cp k = = n + 2df k 

where y t and y t are, respectively, the phenotype and 
the current predicted value of the i th individual; n is the 
number of individual; df is the degree of freedom which 
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here corresponded to the number of active SNPs and 
(T e 2 is the residual variance of the complete model, i.e. 
the model including all SNPs. The latter was estimated 
by a REML procedure running a GBLUP animal model 
with ASREML software [9]. The genomic relationship 
matrix was built as described by Hayes et al. in 2009 
[10]. Since this strategy did not require cross-validations, 
LASSO-LARS was run on the whole reference popula- 
tion until the minimum Cp-type value was reached and 
the corresponding number of active SNPs was taken as 
best t. 

GEBV estimation 

Once the best t were defined, LASSO-LARS was run on 
the whole population. For each strategy the procedure 
was stopped when the corresponding t SNPs were in 
the model. The estimated allelic substitution effects of 
the selected SNPs were used to calculate the GEBVs on 
the candidate population (1,000 progenies without phe- 
notypes). The GEBV accuracy, defined as the correlation 
coefficients between true breeding values (TBVs) and 
GEBVs, and the regression coefficient of the TBVs on 
GEBVs were calculated for each strategy. 

Results 

Best t definition 

The number of SNPs selected at each cross-validation 
replication ranged from 15 to 92 and was on average 46. 
This value was taken as best t for the classical strategy. 
The maximum of correlation between GEBVs and 



phenotypes in the validation sample was 0.493 on average 
and the corresponding R 2 was 0.298. Among the 9,990 
available SNPs only 2,169 occurred at least once overall 
cross-validations and 189 occurred more than 5% of the 
times. The latter value was taken as best t for strategyl. 
Figure 1 depicts the profile of the Cp-type criterion for an 
increasing number of active SNPs. The minimum was 
reached when 64 SNPs were selected and this value was 
taken as best t for strategy2. 

QTL mapping 

Figure 2 shows the absolute values of the effect of the 
SNPs selected by LASSO-LARS given the t values 
defined by the different strategies. The SNP fo in the 
cross-validations is also reported. The average fo of the 
SNPs selected by classical strategy was 39.8 % and ran- 
ged from 10.4 % to 99.7%. For strategyl the fo was 
11.7 % on average, ranging from 0 to 99.7%. In this case 
107 SNPs had an fo lower than 5% and 16 never 
occurred in the cross-validations. The SNPs selected 
with strategy2 showed an average fo of 30.8% ranging 
from 4.4% to 99.7%. Only 3 SNPs had/o slightly lower 
then the 5% threshold. Concerning the similarity 
between the three SNP subsets, 39 common SNPs were 
selected by classical and strategyl; 44 SNPs were 
selected by classical and strategy2 and 55 were selected 
by strategyl and strategy2. On the whole 38 SNPs were 
selected by all three strategies. All the three strategies 
identified quite clearly the five QTL with additive 
effects. A less clear localization was observed for the 
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Figure 2 Comparison of SNP effects estimated by classical, strategyl and strategy2. SNP frequency of occurrence True QTL positions. 



QTL with imprinted effect and the first QTL with epi- 
static action. None of the strategies was able to selected 
SNPs linked to the second epistatic QTL. These latter 
results were expected since LASSO-LARS only accounts 
for additive QTL effects. Most of the SNPs selected by 
classical and strategy2 were concentrated on the true 
QTL position. Finally the number of false detections 
enlarged as the SNP subset size increased and was parti- 
cularly high for strategyl where a very unclear scenario 
was observed. 

GEBV estimation 

The candidate population GEBV accuracies corresponding 
to the three t estimation strategies are shown in Table 1. 
Not relevant differences were observed among strategies. 
In particular, the accuracies obtained by classical and 
strategy2 were very similar and both outperformed strat- 
egyl of more than 2%. Table 1 also shows the regression 
of TBVs on GEBVs, indicating that both classical and 



Table 1 Genomic breeding value (GEBV) accuracy (r) and 
regression coefficient (b) of true breeding value (TBV) on 
GEBV for the three tested strategies 



Strategy 


r(TBV,GEBV) 


b(TBV,GEBV) 


Classical 


0.9237 


1.2512 


Strategyl 


0.9000 


1 .0220 


Strategy2 


0.9240 


1.1877 



strategy2 GEBVs underestimate the TBVs, whilst strategyl 
gave a regression coefficient considerably closer (around 
25%) to the target value 1. However, it is important to 
point out that some of the QTL had simulated epistatic 
and imprinted actions, and LASSO-LARS did not account 
for this. 

Discussion 

Our results demonstrated that LASSO-LARS performs 
well estimating SNPs associated to QTL with additive 
effects. The detection of QTL with different action was 
rather poor. However it suggests the presence of the 
imprinted QTL and of the first epistatic QTL. The second 
epistatic QTL was neglected since LASSO-LARS just 
selects the SNPs which underline the main portion of the 
variability explained by both QTL. Concerning the choice 
of the best constraint for LASSO-LARS, classical and 
strategy2 although based on different procedures gave very 
similar results. This suggests that a valid estimation of the 
best constraint can be obtained without cross-validation 
with a large computing time saving. Indeed, while the 
cross-validation procedure took 3 hours and 35 minutes, 
strategy2 just took 8 seconds. Nevertheless, the current 
data set did not allowed to verify if the constraint estima- 
tion based on Cp-type minimization can overcome the 
underestimation of t expected with cross-validation. Thus 
a study based on a dataset with high ratio between number 
of QTL and reference population size is needed. Strategyl 
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seems in general the worst, since most of the selected 
SNPs did not correspond to those with the highest fo. This 
happened because the complementary among SNPs 
selected in the cross-validation was not accounted for. 
Indeed if two SNPs are strongly correlated and equally 
correlated with the phenotype, they could be alternatively 
selected in the cross-validation due to the random sam- 
pling. Nevertheless they explain the same portion of var- 
iance and when LASSO-LARS runs on the whole 
reference population only one of them is selected. In fact 
most of the SNPs further selected by strategyl respect to 
the other two were false positives (Figure 2). The presence 
of many false positives leads to a lower GEBV accuracy. 
The regression of TBVs on GEBVs close to 1 obtained by 
strategyl could be due to the higher weight given to the 
QTL with imprinted and epistatic effects. 

Conclusions 

We conclude that the strategy based on the Cp-type 
selection criterion is a valid alternative to the cross-vali- 
dations to define the best constraint for selecting subsets 
of predicting SNPs by LASSO-LARS procedure. 
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